A Survey of Duplicate And Near Duplicate Techniques

نویسندگان

  • Rahul Mahajan
  • Rajeev Bedi
چکیده

--World Wide Web consists of more than 50 billion pages online. The advent of the World Wide Web caused a dramatic increase in the usage of the Internet. The World Wide Web is a broadcast medium where a wide range of information can be obtained at a low cost. A great deal of the Web is replicate or nearreplicate content. Documents may be served in different formats: HTML, PDF, and Text for different audiences. Documents may get mirrored to avoid delays or to provide fault tolerance. The problem of finding relevant documents has become much more prominent due to the presence of duplicate data on the WWW. This redundancy in results increases the users’ seek time to find the desired information within the search results, while in general most users just want to cull through tens of result pages to find new/different results. This survey paper has a fundamental intention to present an review of the existing literature in duplicate and near duplicate detection of general documents and web documents in web crawling. Index Terms -Duplicate Content, De-duplication, Near Duplicate , Replicate, Search Engine, Web Crawling, Web Mining ——————————  ——————————

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Study of Progressive Techniques for Efficient Duplicate Detection

---Databases contains very large datasets, where various duplicate records are present. The duplicate records occur when data entries are stored in a uniform manner in the database, resolving the structural heterogeneity problem. Detection of duplicate records are difficult to find and it take more execution time. In this literature survey papers various techniques used to find duplicate record...

متن کامل

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...

متن کامل

A Rare Case Report of duplicate Vents in a Broiler Breeder Hen (Case Report)

Malformations which occur during the development of the avian body organs can lead to structural and functional abnormalities. Most defects are recognized at hatching, but some go undetected until somewhat later. The cause of the majority of animal congenital malformations is unknown. A significant proportion of congenital malformations of unknown cause are likely to have an important genetic c...

متن کامل

Performance of near-duplicate detection algorithms for Crawljax

On the web near-duplicate documents are abundant. As many as 40%of the pages on the Web are near-duplicates of other pages, according toManning et al. [10]. A web crawler should be able to recognize and dealwith near-duplicate web pages.In this survey we will first explore the most prominent duplicate-detectionalgorithms, which could be viable implementations in Crawljax...

متن کامل

Identification of MIR-Flickr Near-duplicate Images - A Benchmark Collection for Near-duplicate Detection

There are many contexts where the automated detection of near-duplicate images is important, for example the detection of copyright infringement or images of child abuse. There are many published methods for the detection of similar and near-duplicate images; however it is still uncommon for methods to be objectively compared with each other, probably because of a lack of any good framework in ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014